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Abstract 

Background: Predicting new non-coding RNAs (ncRNAs) of a family can be done by aligning the potential 
candidate with a member of the family with known sequence and secondary structure. Existing tools either only 
consider the sequence similarity or cannot handle local alignment with gaps. 

Results: In this paper, we consider the problem of finding the optimal local structural alignment between a query 
RNA sequence (with known secondary structure) and a target sequence (with unknown secondary structure) with 
the affine gap penalty model. We provide the algorithm to solve the problem. 

Conclusions: Based on an experiment, we show that there are ncRNA families in which considering local structural 
alignment with gap penalty model can identify real hits more effectively than using global alignment or local 
alignment without gap penalty model. 



Background 

A non-coding RNA (ncRNA) is a RNA molecule that 
does not translate into proteins. It has been shown to 
be involved in many biological processes [1-4]. The 
number of ncRNAs within the human genome was 
underestimated before, but recently some databases 
reveal over 212,000 ncRNAs [5] and more than 1,300 
ncRNA families [6]. Large discoveries of ncRNAs and 
their families show the possibilities that ncRNAs may be 
as diverse as protein molecules [7]. Identifying ncRNAs 
is an important problem in biological study. However, it 
is time consuming and there is no effective method to 
identify ncRNAs in a laboratory, predicting ncRNAs 
based on known ncRNAs using comparative computa- 
tional approach is one of the promising directions to 
identify potential candidates for further verification. 

Most of the computational approaches are based on 
the observation that if two different ncRNA molecules 
are in the same family (with similar biological func- 
tions), they usually exhibit similar sequences as well as 
secondary structures. One common approach [8-10] is 
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as follows. We pick an ncRNA member of a family with 
known sequence and secondary structure (referred as 
the query), scan along a genomic sequence and for each 
possible region (referred as the target), perform an 
alignment between the query and the target to obtain a 
similarity measure to decide if the region is a potential 
ncRNA candidate for that family. The similarity measure 
may only base on the sequence or both the sequence 
and secondary structure (the latter case is referred as 
structural alignment). Along this direction, there are 
some approaches [11-14] that make use of secondary 
structure prediction tools to predict the secondary struc- 
ture to be formed by the target assuming that it is an 
ncRNA before performing the alignment. The accuracy 
may, however, depend on the accuracy of the secondary 
structure prediction tools. 

Instead of using one member of a family, some other 
approaches [15] use a set of ncRNAs from the same 
family to train a model (e.g. covariance model). Then, 
using this model to scan a genomic sequence to identify 
potential regions that are ncRNA candidates of that 
family. What information (sequence similarity and/or 
secondary structure) to be captured from the known 
ncRNAs depends on how we define the model. 



o 
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However, in some cases, we may not have enough 
known members in a family to train a model. In this 
paper, we focus on the problem that uses one known 
member as the query and align it with a target 
sequence. We remark that there are also other computa- 
tional methods that identify ncRNAs without using 
known members in a family. For example, some try to 
identify ncRNAs by considering the stability of second- 
ary structures formed by the substrings of a given gen- 
ome [16]. This method may not be very effective 
because a random sequence with high GC composition 
also allows an energetically favorable secondary struc- 
ture [17]. So, the comparative approach we described in 
the above is still one of the most popular approaches. 

The core idea behind all comparative approaches is to 
compute the similarity between the query (known mem- 
ber^)) and the target (each possible region in the geno- 
mic sequence to be investigated). Some only consider 
sequence similarity which may not work well for 
families in which members do not have high sequence 
similarity (e.g. members of RF00017 in Rfam 9.1 [6] 
only have 39% sequence similarity). For example, 
Gotohscan [8] considers semi-global alignment with 
affine gap penalty according to the sequence similarity 
only. For those also consider the similarity of secondary 
structure, they usually require the whole sequence of 
the query to be aligned with the whole sequence of the 
target (referred as global alignment in the community) 
[10]. However, similar to the protein sequence, the 
ncRNAs in the same family may not have similar 
sequence or structure for the whole sequence but only 
for the substrings of them (those supposed to be the 
functional parts), especially when they belong to species 
with long evolutionary distance apart. Figure 1 shows 
one of these examples. It shows the multiple sequence 
alignment between some members of the family 
RF01051 in Rfam 9.1 database. The two circled mem- 
bers (i.e. AAUO01000012 and AAXYO1000014) are not 



quite similar if we consider the global alignment. Also, 
for the subregions that they look similar (i.e. the circled 
region), there exist large insertion/deletion (gaps). There 
are also evidences that gaps may be common in ncRNA 
homologs [18]. Considering local structural alignment 
with gap model seems to be more appropriate for pre- 
dicting new members for some ncRNA families. [9] con- 
sider some restricted cases of local alignment according 
to the query structure. Another work that also consider 
local alignment is [11], but they cannot handle gaps. 

We consider the following problem. Given a query 
sequence together with its secondary structure, we try 
to identify the substring in the given target sequence 
(with unknown secondary structure) that can align to a 
substring in the query sequence with the highest struc- 
tural similarity score based on the affine gap model (see 
next section for formal definitions). We assume that the 
secondary structures of the ncRNAs are regular, that is, 
they do not have pseudoknots (no two base pairs cross- 
ing each other). This type of ncRNAs is found to be the 
most abundant in existing databases. We consider all 
possible substrings of the query sequence, even for 
those substrings that cover only one of the end points 
of some base pairs in the structure. 

Our result 

We propose a local structural alignment algorithm with 
affine gap model which assumes the secondary structure 
of the query is known while that of the target sequence 
is unknown. The time complexity of our algorithm is O 
(mn 3 ) which is the same as the best algorithm for global 
alignment for this problem where m, n are the lengths 
of the query and the target, respectively. We evaluated 
our algorithm using real data from Rfam database. 
According to the preliminary experiment, it shows that 
there are ncRNA families in which considering local 
structural alignment algorithm with affine gap model 
can distinguish real members from false hits more 




Figure 1 Long gap may exist in conserved local region. Multiple sequence alignment of some seed members of the family RF01051 from 
Rfam 9.1 database. The red and blue highlighted are the base-pair regions. All sequences are aligned according to their structures. If the two 
circled sequences are selected as query and target, the circled region is the conserved local region between them, in which there exists long 
gap inside. 
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effectively than using global alignment or local align- 
ment without affine gap model. 

Preliminaries 

An ncRNA molecule can be regarded as a sequence of 
four characters {A, C, G, U}, each character is referred 
as a base. Some of these bases may form pairs (linked 
up by a hydrogen bond) with some restrictions such as 
each base can only pair up with at most one other base 
and only complementary bases can form a pair (e.g. {A, 
U), (C, G), (U,G)). The set of base pairs formed by the 
molecule is referred as its secondary structure. 

Formally speaking, let S = Sis 2 ••• s m be a length-w 
ncRNA sequence where s t e {A, C, G, U] for 1 < i < m 
and M be the secondary structure of S. M is represented 
as a set of base pair positions, i.e. M = {(/, j) | 1 < i <j < m, 
(s ; , Sj) is a base pair}. If (s„ Sj) is base pair, then (s it Sj) e 
{{A, U), (C, G), (G C), (G, ty, (U, A), (U, G)}. Let M x _ y £ 
M be the set of base pairs within the subsequence s x s x 
+1 ...s y , 1 < x <y < m, i.e., M x _ y = {(i, j) e M\x < i <j < y}. 
Note that if (i, j) e M and only i or / inside the region 
[#...yi, then (i, j) £ M Xiy . We assume that there is no two 
base pairs sharing the same position, i.e., for any (i v 
(i 2 , j 2 ) e M, h * j 2 , h * ju and z'i = i 2 if and only if j 1 = j 2 . 

A regular structure is the structure in which there 
does not exist any two base pairs crossing each other. 
The formal definition is as follows: 

Definition lM x>y is a regular structure if there does 
not exist two base pairs (i, j), (k, I) e M xy such that i <k 
<j <lor k <i <l <j. 

Note that an empty set is also considered as a regular 
structure. 

Problem definition 

Structural alignment with affine gap model 

Let S[l...m] be a query sequence with known secondary 
structure M, and T[l...n] be a target sequence with 
unknown secondary structure. 5 and T are both 
sequences of {A,C,G,U}. A structural alignment between 
S and T is a pair of sequences S'[l...r] and T'[l...r] 
where r > m, n, S' is obtained from 5 and T is obtained 
from T with spaces inserted to make both of the same 
length. A space cannot appear in the same position of S' 
and T. A maximal consecutive set of i spaces in either 
S' or T is referred as a gap of length £. The score of the 
alignment (with affine gap penalty model), which deter- 
mines the sequence and structural similarity between S' 
and T, is defined as score = 

l<i<r s.t. s'[i],T'|i]*'_' i,i s-t. ri(iVi(;)eM, 

where rj(«) is the corresponding position in S accord- 
ing to the position i in S'; y{u x ,u 2 ) and S{u lt u 2 , v lt v 2 ) 



where U\, u 2 , v 1; v 2 e {A, C, G, U}, are scores for char- 
acter similarity and for base pair similarity respectively; 
k and / is the number of gaps and the total length of all 
gaps; h and s is the gap starting and extending penalty. 

Definition 2 An optimal global structural alignment 
between S and T is a structural alignment of S and T 
such that the alignment score is maximum. 

Let S[x...y] where 1 < x, y < m be a substring of S 
with secondary structure M x-y (where S[x...y] is an 
empty string with empty structure if x >y). Similarly, let 
T[x'...y'] where 1 < x', y' < n be a substring of T (where 
T[x'...y'] is an empty string if x' >y'). 

Definition 3 An optimal local structural alignment 
between S and T is a global structural alignment 
between two sub stings of S and T, S[x...y] and T[x'...y'] 
where 1 < x, y < mand 1 < x', y' < n of S and T such 
that the alignment score between them is maximum over 
all possible substrings. 

Given S (with known secondary structure) and T (with 
unknown structure), we want to compute an optimal 
local structural alignment with affine gap penalty 
between S and T. 

Results and discussion 

The details of the algorithm for solving the problem will 
be given in Method Section. In this section, we evaluate 
the resulting algorithm and show that considering local 
structural alignment with affine gap model can improve 
the effectiveness of locating ncRNAs for the families in 
which members may have variable size of hairpins, loops 
or stems when compared to using global alignment [10], 
local alignment without gap penalty model and Gotohscan 
[8]. Note that the differences in size of hairpins, loops or 
stems represent gaps in the corresponding sequences. 

To test the algorithm, we selected around twenty 
ncRNA families in which the members have variable 
sizes of hairpins, loops or stems. We construct our test- 
ing cases based on real ncRNAs as follows. For each 
family, we first select a seed member (i.e. In Rfam data- 
base, there is a set of reliable members which are 
regarded as seed members) as the query sequence Q. To 
demonstrate the effectiveness of the affine gap model, 
we select the longest seed member as this query 
sequence. We then created a long random sequence 
with even distribution of four characters {A, C, G, J) to 
simulate a long genome. The length of this long random 
sequence is around ten times of the total length of all 
the seed members of the family. Finally, we embedded 
the full members (i.e. all the members including the 
seed members) of the family (except the one chosen as 
query) into this long random sequence in arbitrary posi- 
tions. If there are more than 100 members, then we ran- 
domly picked 100 of them. This resulting sequence is 
our T. 
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Let / be the the maximum length of all the members 
of the family. For every region in T with length 1+20, we 
compute the structural alignment score of the region 
and the query sequence. We use the same scoring 
scheme as in [9] and set the gap starting penalty (h) and 
gap extension penalty (s) to be 5 and 0.1, respectively. 
The details of the families including the sequence 
selected as the query, the length of the sequence, and 
the number of members embedded in each family are 
given in Table 1. 

We compare our algorithm with the global structural 
alignment [10], local structural alignment without affine 
gap model and Gotohscan [8]. Gotohscan was used to 
locate ncRNAs candidates on Trichoplax adhaerens by 
using single real ncRNA as query. It was designed to 
check only sequence similarity with affine gap model. 
Since the global structural alignment software is not 
available, we implemented both global and local without 
affine gap algorithms. For Gotohscan, we downloaded 
the version 1.3 from the website. We assume that 
regions other than the members of the family are false 
hits as they are likely not to be members of the family. 
To compute the effectiveness of our method, we use the 



Table 1 The details of the ncRNA families used in the 
experiments. 

Family Query Sequence ID Length Number of members 

embedded 



RF00014 


CP000468.1/2032552- 
2032638 


87 


96 


RF00021 


CP000851. 1/1 13395- 
113522 


128 


100 


RF00022 


AAN D0 100002 1.1/495- 

707 


213 


100 


RF00027 


AAPE01 289140.1/8905- 
8994 


90 


100 


RF00032 


S491 18.1/1081-1 106 


26 


100 


RF00033 


Y1 5844.1/450-543 


94 


100 


RF00034 


BX571867.1/288515- 
288628 


114 


100 


RF00038 


AJ 132964. 1/66- 198 


133 


100 


RF00039 


AF370716.1/3603-3656 


54 


100 


RF00042 


X55895. 1/474-565 


92 


100 


RF00043 


Z4741 0.1/1 220-1 294 


75 


21 


RF00044 


M1 1813.1/4883-5126 


244 


8 


RF00046 


AY0 13245.2/62208- 
62303 


96 


76 


RF00048 


AF504534. 1/666-726 


61 


100 


RF00386 


AF363455.1/1-122 


122 


100 


RF00643 


AASG02000279.1/67999- 
67862 


138 


100 


RF00661 


AC1 54049.1/4734-4855 


122 


100 


RF01051 


AE014299.1/1 112481- 


94 


100 



1 1 1 2574 



smallest threshold with no false positive and the thresh- 
olds of allowing 5% or 10% false positive rate. We 
assume that the method finds a real hit if the score of 
the region is larger than this threshold. Thus a real hit 
will be missed if the computed score is smaller than or 
equal to this threshold. Table 2 and Table 3 summarize 
the result when using different algorithms to locate the 
other ncRNA members along the genome. When the 
smallest threshold with no false positive is used, the 
average percentage of misses when using Gotohscan is 
53.9%; global alignment is 18.4%; local alignment with- 
out affine gap model is 17.9%; local alignment with 
affine gap model is 7.2%. When the threshold of allow- 
ing 5% or 10% false positive rate is used, the results 
show that the local structural alignment algorithm with 
affine gap model also works satisfactory except for the 
family RF00033. Table 4 also lists the area under the 
ROC curve when considering the false positive rate < 
10%. Note that the area is normalized to the range 
between 0 and 1. 

We also use RF00661 as an example and show the 
score distribution between the real hits and the false 
hits when using different algorithms in Figure 2. As one 
can see, the local structural alignment algorithm with 
affine gap penalty can increase the difference between 
the scores of real hits and the scores of false hits com- 
pared with the other methods, and so it has a higher 
distinguishing power to identify the real ncRNA mem- 
bers along the long genome sequence for these families. 

Our program take around 15 seconds for performing 
local structural alignment with affine gap model 
between query and target of around 150 bases long, and 
around 30 seconds for 200 bases long. We tested the 
program on a machine with 2.4GHz dual-core CPU and 
8G memory. 

Conclusions 

In the paper, we provided an algorithm to handle local 
structural alignment with affine gap model of RNA with 
regular structure that compute the optimal alignment. 
Our experiments show that the solution is effective for 
some ncRNA families in which members may have vary- 
ing sizes on hairpins, loops or stems (contributing to 
large gaps) when compared to using only global align- 
ment or local alignment without gap model. And also 
we have not yet studied different types of gap penalty 
model and the effect of setting different gap penalty 
parameters. Other interesting directions include speed- 
ing up the algorithm and considering other more com- 
plicated structures (e.g. the structures with 
pseudoknots). In the mean time, we have completed the 
algorithm of computing local structural alignment for 
simple pseudoknots structure, and we are now in the 
progress of performing experiments. 
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Table 2 Summary of comparison on results between global alignment, local alignment without gap penalty and local 


alignment with affine gap penalty when using the smallest threshold such that there is no false positive. 




Family Number of members 












Number of misses 












Gotohscan [8] % 




Global [10] 


% 


Loca 


% 


Local with affine gap % 


RF00014 


96 


Z 


1 1 OA 
Z. 1 70 




u 




0% 


0 


0% 


U 


U70 


RF00021 


100 


1 U 


1 cioa 
I Uto 




5 




5% 


5 


5% 


z 


Z70 


RF00022 


100 




jyyo 




zu 




20% 


19 


19% 


A 
4 


A OA 
470 


RF00027 


100 


1 nn 
I uu 


1 nnoA 

I UU70 




1 c 




15% 


9 


9% 


Z 


Z70 


RF00032 

1 \ 1 \JVfVf~JZ- 


100 


RO 

jy 


^QOA 




A 
/ \ 




4% 


1 


1% 


U 


noA 

U70 


RF00033 


100 


zy 


OQOA 
ZyTO 




Z/ 




27% 


27 


27% 


ZD 


l^OA 
ZJ70 


RF00034 


100 


71 
/ 1 


71 OA 
/ I 70 




1 1 




1 1% 


22 


22% 


7 


70A 
/70 


RF00038 


100 


88 


88% 




o 




0% 


0 


0% 


0 


0% 


RF00039 


100 


100 


100% 




1 




1% 


1 


1% 


1 


1% 


RF00042 


100 


10 


10% 




0 




0% 


0 


0% 


0 


0% 


RF00043 


21 


3 


143% 




0 




0% 


0 


0% 


0 


0% 


RF00044 


8 


1 


1 2.5% 




0 




0% 


0 


0% 


0 


0% 


RF00046 


76 


9 


1 1 .8% 




2 




2.6% 


1 


1.3% 


0 


0% 


RF00048 


100 


17 


17% 




0 




0% 


0 


0% 


0 


0% 


RF00386 


100 


88 


88% 




63 




63% 


62 


62% 


6 


6% 


RF00643 


100 


98 


98% 




4 




4% 


13 


13% 


0 


0% 


RF00661 


100 


100 


100% 




87 




87% 


77 


77% 


30 


30% 


RF01051 


100 


100 


1 00% 




91 




91% 


85 


85% 


52 


52% 






average 


53.9% 








18.4% 




1 7.9% 




7.2% 


Methods 












sequence with known structure M and T[l.. 


.«] be the 


We develop a 


dynamic prog 


jamming 


algorithm to solve 


tarj 


$et sequence with unknown structure. 




the problem. 


Before we describe the method, we 


would 


Definition 4 Optimal prefix- 


global structural alignment 


like to define 


some variations of alignments which will 


between S[l„ 


.m] and T[l...n] 


is to find a prefix S[l...y\ 


be used in our algorithm. 


Let S[l. 


..m] be the 


query 


where 0 < y < 


m(i.e. S is an empty string when y = 0) such 


Table 3 Summary of comparison on results between global alignment, local alignment without gap penalty and local 


alignment with affine gap penalty when setting the threshold which allows 5% or 10% of false positives. 




Family Number of members 












Number of misses 














False positive rate=5% 








False positive rate=10% 








Gotohscan 


Global Loca 


Local with affine gap 


Gotohscan Global Local Local with affine gap 


RF00014 


96 


2 


0 


0 




0 




2 


0 


0 


0 


RF00021 


100 


10 


1 


1 




1 




10 


1 


1 


1 


RF00022 


100 


51 


9 


5 




2 




35 


4 


4 


2 


RF00027 


100 


100 


3 


5 




0 




100 


2 


2 


0 


RF00032 


100 


59 


0 


0 




0 




37 


0 


0 


0 


RF00033 


100 


27 


1 


25 




24 




26 


1 


1 


24 


RF00034 


100 


71 


1 


0 




0 




71 


1 


0 


0 


RF00038 


100 


88 


0 


0 




0 




88 


0 


0 


0 


RF00039 


100 


100 


0 


0 




0 




100 


0 


0 


0 


RF00042 


100 


10 


0 


0 




0 




10 


0 


0 


0 


RF00043 


21 


3 


0 


0 




0 




3 


0 


0 


0 


RF00044 


8 


1 


0 


0 




0 




1 


0 


0 


0 


RF00046 


76 


9 


0 


0 




0 




9 


0 


0 


0 


RF00048 


100 


11 


0 


0 




0 




11 


0 


0 


0 


RF00386 


100 


88 


58 


56 




1 




88 


48 


38 


1 


RF00643 


100 


98 


1 


4 




0 




98 


0 


2 


0 


RF00661 


100 


100 


87 


66 




23 




100 


81 


52 


11 


RF01051 


100 


100 


79 


85 




47 




100 


79 


81 


39 
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Table 4 Summary of the area (normalized) under ROC 
curve for false positive rate < 10% 

Family Area (normalized) under ROC curve 





Gotohscan 


r;inh^l 

olUUdl 


Loca 


Local with affine 


RF00014 


0.98 


1 .0 


1.0 


1.0 


nrAnm 1 


0.9 


0.99 


0.99 


0.99 


KhUOOzz 


0.53 


0.92 


0.93 


0.98 


RF00027 


0.0 


0.96 


0.96 


1.0 


RhU0032 


0.61 


0.99 


1.0 


1.0 


RF00033 


0.73 


0.93 


0.79 


0.76 


RhUU034 


0.29 


0.98 


0.99 


0.99 


RF00038 


0.12 


1.0 


1.0 


1.0 


RF00039 


0.0 


1.0 


1.0 


1.0 


RF00042 


0.9 


1.0 


1.0 


1.0 


RF00043 


0.86 


1.0 


1.0 


1.0 


RF00044 


0.88 


1.0 


1.0 


1.0 


RF00046 


0.88 


1.0 


1.0 


1.0 


RF00048 


0.89 


1.0 


1.0 


1.0 


RF00386 


0.12 


0.42 


0.49 


0.98 


RF00643 


0.02 


0.99 


0.96 


1.0 


RF00661 


0.0 


0.14 


0.36 


0.79 


RF01051 


0.0 


0.18 


0.17 


0.56 



that the score of the optimal global structural alignment 
between the prefix S\\...y\ and T\\,...n\ is maximum. 

Definition 5 Optimal suffix-global structural align- 
ment between S[l...m] and T[l...n] is to find S[x...m] 
where 1 < x < m + 1 (i.e. S is an empty string when x = 
m + 1) such that the score of the optimal global struc- 
tural alignment between the suffix S[x...m] and T[l...n] 
is maximum. 

Definition 6 Optimal semi-global structural alignment 
between S[l...m] and T[l...n] is to find a substring S[x... 
y] where 1 < x, y < m such that the score of the optimal 
global structural alignment between the substring S[x...y] 
and T[l...n] is maximum. 

Let the affine gap model be h + sL, where h is the gap 
opening penalty, s represents a gap extension penalty, 
and L denotes the length of gap. Our method consists 
of two steps. In the first step, we compute the optimal 
semi-global structural alignment between S and all pos- 
sible substrings of T. In the second step, we obtain the 
optimal local structural alignment between 5 and T 
resulted in the first step. Define A(p, q, e, f) to be the 
score of the optimal semi-global structural alignment 
between S[p...q] and T [e..f\. The score of the optimal 
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Figure 2 Score distribution between the real hits and the false hits when using different algorithms for the family RF00661. The figure 
shows the comparison on score distribution of real hits (i.e. real members) and false hits for the family RF00661 between different algorithms. It 
shows that the local structural alignment algorithm with affine gap penalty can increase the difference between the scores of real hits and the 
scores of false hits compared with the other methods, and so it has a higher distinguishing power to identify the real ncRNA members along 
the long genome sequence. 
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local structural alignment between S and T can be 
obtained from the entry max eS f +l A{l, m, e,j). We first 
show how to compute A, then show how to use the 
structure of S to guide the computation of A without 
considering all possible combinations of p, q. 

When considering any substring S' = S[x' '.../] of S[x... 
y], there are four possible cases: (1) S' is equal to S (i.e. 
x' - x, y' - y); (2) S' is a proper prefix in S (i.e. x' = x, y' 
<y); (3) S' is a proper suffix in S (i.e. x' >x, y' = y); (4) S' 
is a substring of S[x + l...y - 1] (i.e. x' >x, y' <y); There- 
fore, we can consider each case one by one when com- 
puting the value of A. 

Define A^ip, q, e,f) to be the score of the optimal global 
structural alignment between S[p...q] and T[e...J\. Define A 2 
(p, q, e,f) to be the score of the optimal prefix-global struc- 
tural alignment between S[p...q - 1] and T[e...f\. Define A 3 
(p, q, e,f) to be the score of the optimal suffix-global struc- 
tural alignment between S[p + l...q] and T[e...f\. Define A 4 
(p, q, e,f) to be the score of the optimal semi-global struc- 
tural alignment between S[p + l...q - 1] and T[e...fi. 

The value of A(p, q, e,f) can be computed recursively 
and it is the maximum value of four cases: (1) when S' 
= S[p, q] (i.e. A x {p, q, e,j)); (2) when 5' is a proper prefix 
of S[p, q] (i.e. A 2 (p, q, e,f)); (3) when 5' is a proper suf- 
fix of S[p, q] (i.e. A 3 (p, q, e,f); (4) when 5' is a substring 
of S[p + 1, q - 1] (i.e. A 4 (p, q, e, f); Lemma 1 sum- 
marizes these cases. 

Lemma 1 



A(p,q,e, f) = max 



The following subsections describe how to compute 

AiAiAsAi- 

Calculation of A, 

When considering the optimal global structural align- 
ment (with affine gap model) between S[p...q] and T 
[e...f], there are nine possible cases: (1) S[p] is aligned 
with T[e] and S[q] with T[f\; (2) S[p] with T[e] and S[q] 
with space;(3) S[p] with T[e] and T[f]withspace; (4) S[p] 
with space and S[q] with T\f\; (5) S[p] with space and S 
[q] with space; (6) S[p] with space and T[f] with space; 
(7) T[e] with space and S[q] with T[f\; (8) T[e] with 
space and S[q] with space; (9) T[e] with space and T[f] 
with space. Hence, we can consider each case one by 
one when computing the value of A^ 

Define A lx (p, q, e,j), where 1 < x < 9, to be the score 
of the optimal global structural alignment between S[p... 
q] and T[e...f\ where the above case x is satisfied, (i.e. if 
x - 1, then S[p] is aligned with T[e] and S[q] with T\f\). 



The value of A t {p, q, e,f) can be computed recursively 
and it is the maximum value of nine cases. Lemma 2 
summarizes these cases. 

Lemma 2 

[ A„(p, q, e, {), A l2 (p, q, e, {), A 13 (p, q, e, /), 
MP. q, e, f) = max \ A 14 (p, q, e, /), A 15 (p, q, e, /), A 16 (p, q, e, /), 
[ A 17 (p, q, e, f), A 18 (p, q, e, f), A 19 (p, q, e, f), 

We will describe the calculation of A 12 . Similar skill 
can be applied for the others (i.e. A n , A 13 , ... , A 19 ). 
Calculation of A 12 

A\ 2 (p, q, e, f) is the score of the optimal global struc- 
tural alignment between S[p...q] and T[e...f\, which 
aligns S[p] with T[e] and S[q] with space. There are 
three situations and we need to consider them one by 
one. Note that according to the affine gap model, the 
penalty of a first space in a gap (i.e. which is h + s) is 
different from the penalty of the other space in a gap 
(i.e. which is s). Situation I: when (p, q) is a base pair 

- aligning the base pair S[p] with T[e] and S[q] with 
space. Considering the alignment between S[p + l...q 

- 1] and T[e + I...J], if S[q - 1] is aligned with space 
(i.e. case 2, case 5 and case 8), then a penalty s should 
be considered. Otherwise (i.e. for the other six cases), 
a penalty h + s should be considered. Situation II: 
when 3q' where p <q' <q such that {p, q') is a base 
pair - we need to find k e [e - l,f] such that the sum 
of the alignment score between S[p, q'] and T[e, k], 
and that between S[q' + 1, q] and T[k + 1, f] is maxi- 
mum. Since S[p] is aligned with T[e] and S[q] with 
space, the alignment between S[p,q'] and T[e, k] 
should satisfy the case 1, case 2 and case 3 (i.e. S[p] is 
aligned with T[e]). Similarly, the alignment between S 
[q' + 1, q] and T[k + 1, f] should satisfy the case 2, 
case 5 and case 8 (i.e. S[q] is aligned with space). 
Situation III: when p does not form base pair with any 
base q' e [p, q] - we align base S[p] with T[e]. Then 
the alignment between S[p + l...q] and T[e+ l...f\ 
should satisfy the case 2, case 5 and case 8 (i.e. S[q] is 
aligned with space). Lemma 3 summarizes these 
situations: 

Lemma 3 



/ /if(p,q)inM M 

max , re ii,i3,i4,i6,i7,i9{ A o [p + hq-le + l,f]-h, 

0612,15,18 

A p \p + \,q-\,e + \,f}} + y(S\p\,T[e\)-s 

1 1 if 3q' where p <q' <q such that (p, q') is a base pair 

max eSfes/ A a [p,q'e,h\ + Ap[q' + l,q,h + l,f\ 

os{ll,12,13) 
06(12,15,18} 

/ / if^q' such that (p, q') e M M 

ra»> £ (i2is,iBi Ap\p + hq.e + 1, /] + y{S[p], T[e\) 



A 12 (p,q,e, f) = max 
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Calculation of A 2 

When considering the optimal prefix-global structural 
alignment (with affine gap model) between S[p...q] and 
T[e...f\, there are four possible cases: (1) S[p] is aligned 
with T[e]; (2) S[p] with space; (3) T[f] with space; and 
(4) an empty string of S with T. 

Define A 2x (p, q, e, J), where 1 < x < 3, to be the score 
of the optimal prefix-global structural alignment 
between S[p...q] and T[e...f\ where the above case x is 
satisfied, (i.e. if x = 1, then S[p] is aligned with T[e]). 
Note that we do not need to define function for the 
case 4 because the corresponding score is - h - sif - e 
+ 1). The value of A 2 (p, q, e, f) can be computed recur- 
sively and it is the maximum value of four cases. 
Lemma 4 summarizes these cases. 

Lemma 4 

A 2 (p, q,e,f) = max{A 21 [p, q,e,f\, A 22 [p, q,e,f\, A 23 [p, 
q, e,f\, - h - s(f- e + 1)} 

We will describe the calculation of A 22 . Similar skill 
can be applied to calculate A 2l and A 23 . 
Calculation of A 22 

The following lemma lists out the computation of A 22 . 
Lemma 5 



/ 1 if ip,q)inM M 

max «E2i,23 K \P + 1. 1 ~ L e, f]-{h + s) 
max M22 A„[p + 1,4-1, e,/]-(s) 

maX aEll,12,13,17,18,19 A a\P + L 1 " 1. *. /I " i h + S ) 
maX £<El4,15,16 A a\P + 1- 1 ~ 1. 8. /I - M 

/ / if 3(f where p < c( < q such that (p, q') is a base pair 



A 22 (p, q, e, f) = max 



e~l<k<f 



A a \p,q',e,k] + A 2 [q' + l,q,k + l,f] 



a€{14,15,16} 

A 22 \p,q J ,e, f] 

I j if ^q' such that (p, q') g M p ^ 
max «E{2i,23} A « [P + 1. <?- <?, /!-('■ + s ) 
m aXaE{22) A a \p + l,q,e,f]-s 



A 22 (p, q, e,f) is the score of the optimal prefix-global 
structural alignment between S[p...q - 1] and T[e...f\, 
where S[p] is aligned with space. Similar to A l2 , there 
are also the same three situations. Situation I: when (p, 
q) is a base pair - aligning the base pair S[p] with space. 
Since a prefix of S[p...q - 1] is considered, there are 
two possibilities: a prefix of S[p + l...q - 1] is aligned 
with T[e...f\ (i.e. semi-global alignment), or the whole 
sequence S[p+ l...q - 1] is aligned with T[e...f\ (i.e. glo- 
bal alignment). Situation II: when 3q' where p <q' <q 
such that (p, q') is a base pair - we need to find k e [e 

- 1, f\ such that the sum of the alignment score 
between S[p, q'] and T[e, k], and that between S[q' + 1, 
q] and T[k + l,f\ is maximum. Since a prefix of S[p...q 

- 1] is considered, there are two possibilities: (1) the 
whole sequence S[p, q'] is aligned with T[e, k] (i.e. global 
alignment) and a prefix of S[q' + 1, q] is aligned with T 
[k + l,f\ (i.e. semi-global); (2) a prefix of S[p, q'] is 
aligned with T[e, k] (i.e. semi-global) only. Situation III: 



when p does not form base pair with any base q' e \p, 
q] - we align base S[p] with space. For each possibility 
of situation I & III, there are also two conditions: if S[p 
+ 1] is aligned with T[e] or T[e] is aligned with space, 
the penalty score h + s should be considered. Otherwise, 
if S[p + 1] is aligned with space, then the penalty score s 
should be considered. The lemma 5 summarizes these 
cases. 

The calculations for A 3 and A 4 are similar. In the fol- 
lowing subsection, we will describe the time complexity 
of the algorithm. 

Time complexity 

To fill the dynamic programming table, not all entries 
for all possible subrange of S needs to be filled. Accord- 
ing to the design of the dynamic programming, there 
are three cases: 

Case 1: if (p, q) e M M , then all the entries for S[p, q] 
of all tables (i.e. A, A lt A 2 , A 3 , A 4 , An, etc.) can be 
computed from the entries for S[p - 1, q + 1]. 

Case 2: if 3q' <q s.t. (p, q') e M pq , then all the entries 
for S[p, q] of all tables can be computed from the 
entries for S[p, q'] and S[q' + 1, q\. 

Case 3: if iq' s.t. (p, q') e M M , then all the entries for 
S[p, q] of all tables can be computed from the entries 
for S[p + 1, q]. 

Therefore, we define a function C(p> q) to determine 
for which set of subregions in S, we need to fill the cor- 
responding entires in all the tables. 



{(p + 1, q- 1)} if (p,q)eM M 
{ [p + 1, q'), (<?' + 1, q) } if 3q' < q s.t. (p, <?') e M 
{(p + !,</)} iW s.t. (p,q')eM M 



p.q 



We only need to fill in the entries for all the tables 
provided (p, q) can be obtained from (1, m) by applying 
C function repeatedly. Considering the C function, each 
time the total size of the subregions outputted cannot 
be greater than the size of the input region and each of 
the subregions outputted is smaller than the input 
region. Therefore, in total there are only 0(m) such (p, 
q) values. Also, there are 0(« 2 ) values of different (e,f) 
values, and for each entry, it takes 0{n) because of the 
consideration ofe — l<£</in the case that 3q' <q s.t. 
(p, q') e M M . After finishing the calculation of values A 
(1, m, e, f) for all 1 < e, / < «, the final answer (i.e. 
max e <y +1 {j4(l, m, e,f)}) can be computed in 0(n 2 ) time. 
Therefore the total time complexity = 0{mn 3 ) + 0(n 2 ) = 
0{mn). 

Theorem lFor any sequence S[l..m] with regular 
structure and any sequence T[l...n\ with unknown struc- 
ture, the optimal local alignment score between S[l..m] 
and T[l..n] can be computed in 0(mn 3 ). 
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